graph TB Client[Client Application] --> Prompter Prompter --> RequestProcessor Prompter --> Dataset RequestProcessor --> LLMProvider[LLM Provider] Dataset --> Storage[File Storage] RequestProcessor --> Storage subgraph Core Components Prompter[Prompter] RequestProcessor[Request Processor] Dataset[Dataset Manager] end
Description: The high-level architecture shows the main components of Curator:
sequenceDiagram participant Client participant Prompter participant PromptFormatter participant RequestProcessor participant MetadataDB Client->>Prompter: Initialize(model_name, prompt_func, parse_func) Prompter->>PromptFormatter: Create formatter Prompter->>RequestProcessor: Initialize processor Prompter->>MetadataDB: Store configuration MetadataDB-->>Prompter: Configuration stored Prompter-->>Client: Ready for processing
Description: The initialization process involves:
graph LR Input[Input Dataset] -->|Load| Prompter Prompter -->|Format| GenericRequest GenericRequest -->|Process| RequestProcessor RequestProcessor -->|Call API| LLMProvider LLMProvider -->|Response| GenericResponse GenericResponse -->|Parse| Dataset Dataset -->|Save| Storage subgraph Processing Pipeline GenericRequest RequestProcessor GenericResponse end
Description: The data processing pipeline shows how data flows through the system:
graph TB Dataset -->|1| BatchProcessor BatchProcessor -->|2| RequestQueue[Request Queue] subgraph Workers RequestQueue -->|3| Worker1[Worker 1] RequestQueue -->|3| Worker2[Worker 2] RequestQueue -->|3| WorkerN[Worker N] end subgraph Rate Limiting Worker1 --> RateLimiter Worker2 --> RateLimiter WorkerN --> RateLimiter end RateLimiter -->|4| LLMProvider LLMProvider -->|5| ResponseCollector ResponseCollector -->|6| Storage
Description: The batch processing system:
sequenceDiagram participant Client participant OnlineProcessor participant RateLimiter participant LLMProvider participant ErrorHandler Client->>OnlineProcessor: Submit request OnlineProcessor->>RateLimiter: Check capacity alt Has capacity RateLimiter->>OnlineProcessor: Allow request OnlineProcessor->>LLMProvider: Make API call alt Success LLMProvider-->>OnlineProcessor: Return response OnlineProcessor-->>Client: Return result else Error LLMProvider-->>ErrorHandler: Handle error ErrorHandler->>OnlineProcessor: Retry or fail end else No capacity RateLimiter->>OnlineProcessor: Throttle OnlineProcessor->>OnlineProcessor: Wait end
Description: Online processing handles real-time requests:
graph TD RequestProcessor -->|Requests| RequestsJSONL[requests_*.jsonl] RequestProcessor -->|Responses| ResponsesJSONL[responses_*.jsonl] subgraph Persistent Storage RequestsJSONL --> ArrowDataset[dataset.arrow] ResponsesJSONL --> ArrowDataset ArrowDataset --> HuggingFace[HuggingFace Dataset] end subgraph Metadata Storage MetadataDB[(SQLite DB)] MetadataDB -->|Run History| RunTracking[Run Tracking] MetadataDB -->|Dataset Hash| Reproducibility end
Description: The storage system provides:
stateDiagram-v2 [*] --> RequestCreation: Create Request RequestCreation --> RateCheck: Check Rate Limits RateCheck --> APICall: Capacity Available RateCheck --> Throttled: No Capacity Throttled --> CoolDown: Wait CoolDown --> RateCheck: Retry APICall --> Success: 200 OK APICall --> RateLimit: 429 Rate Limited APICall --> APIError: Other Error RateLimit --> ExponentialBackoff: Wait 15s ExponentialBackoff --> RateCheck: Retry APIError --> RetryCheck: Check Attempts Left RetryCheck --> APICall: Retry Available RetryCheck --> Failed: No Retries Left Success --> ResponseValidation: Parse Response ResponseValidation --> [*]: Valid ResponseValidation --> Failed: Invalid Failed --> ErrorLogging: Log Error ErrorLogging --> [*]: Complete
Description: The error handling system implements:
graph TB Curator[Curator Library] --> External[External Systems] subgraph External Systems HuggingFace[HuggingFace] OpenAI[OpenAI API] Anthropic[Anthropic API] Custom[Custom LLMs] end subgraph Integration Layer Adapters[Provider Adapters] DataConverters[Data Converters] APIClients[API Clients] end Curator --> Adapters Adapters --> External
Description: Integration capabilities include: